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Abstract 

Most  pattern  recognition  tasks  can  be  abstracted  to  a  problem  of  uti¬ 
lizing  comparisons  between  objects  to  perform  the  given  inference  task. 
Often  these  comparisons  are  in  the  form  of  a  distance  measure  or  dis¬ 
similarity.  The  design  of  appropriate  comparison  functions  for  particular 
inference  tasks  is  an  area  of  extensive  research,  and  often  rests  on  ex¬ 
pert  knowledge  of  the  problem  domain.  If  the  data  of  interest  come  from 
two  different  sensors,  or  consist  of  very  different  types  of  data,  a  single 
dissimilarity  may  be  inappropriate;  instead,  one  might  utilize  several  dis¬ 
similarities,  each  designed  for  a  specific  sensor  or  data  stream.  In  this 
work  we  consider  the  problem  of  fusing  information  obtained  from  very 
different  sensors  or  sources,  encoded  through  the  use  of  dissimilarity  func¬ 
tions.  Given  n  observations  from  source  j,  we  have  an  n  x  n  dissimilarity 
measure  Dj,  and  we  wish  to  utilize  all  this  information  in  our  inference. 
We  describe  several  methods  of  utilizing  these  dissimilarity  matrices  that 
are  based  on  embedding  the  observations  into  a  single  space.  These  meth¬ 
ods  optimize  either  the  fidelity  (whether  the  distances  in  the  embedded 
space  match  the  original  dissimilarities)  or  the  commensurability  (whether 
matched  objects  from  different  sensors  are  close  in  the  embedded  space) 
or  both.  We  discuss  the  properties  of  these  embeddings,  apply  the  idea 
to  a  problem  in  network  modeling,  and  point  out  some  interesting  areas 
of  further  research. 


1  Introduction 

Consider  the  problem  of  fusing  the  information  from  two  sensors  in  order  to 
perform  a  given  inference.  In  the  case  we  consider  in  this  paper,  there  will  be 
two  different  sets  of  observations  from  two  different  sensors,  and  we  wish  to 
combine  the  observations  from  the  two  sensors.  We  will  consider  only  the  case 
of  two  sensors,  but  the  multiple  sensor  case  can  be  analyzed  similarly.  Much  of 
the  work  in  this  paper  has  been  reported  in  a  number  of  papers,  in  particular 
Priebe  et  al.  [2010],  Marchette  [to  appear]  although  we  will  consider  some  new 
results  and  provide  some  new  insights  into  the  methodologies. 
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Formally,  let  S  be  a  space,  and  7To  :  H  — >  So,7Ti  :  S  — >  Si  continuous 
maps  into  two  dissimilarity  spaces.  Recall  that  a  dissimilarity  space  X  is  a 
space  in  which  a  function  d  :  lx  I  -)  R  is  defined  with  the  properties:  1) 
d(x ,  y )  >  0;  2)  d(x,  y)  =  0  -4=>  x  =  y.  The  maps  7Tj  can  be  considered  to  be 
the  measurements  from  two  sensors.  Let  pi  be  embeddings  defined  on  5,  into 
a  space  X  (in  this  discussion  Rd  for  a  fixed  d).  Throughout  this  paper  we  will 
assume  the  embeddings  are  performed  through  multidimensional  scaling  (MDS) 
(Borg  and  Groenen  [1997])  without  explicitly  defining  which  specific  approach 
to  MDS  is  used. 


/do 

X  =  Rd 


Let  Xn  =  {x\ . . . ,  xn }  C  S  be  observations  in  the  original  space,  and  denote 
by  x\  the  image  of  Xi  under  ttj:  x?  =  7 Tj(xi).  p  is  an  unknown  (and  possibly 
fictitious)  “manifold  matching”  function.  Note  that  we  assume  we  have  no  way 
to  observe  H  directly,  we  can  only  observe  the  image  under  the  7 r*.  We  also 
assume  that  we  have  no  idea  what  the  “manifold  matching”  function  p  is,  and 
in  fact  in  many  instances  of  interest,  this  function  may  not  even  exist  it  is 
only  notional.  Thus,  this  work  is  not  aimed  at  trying  to  estimate  p  (although, 
see  Marchette  [to  appear]). 

Our  methodologies  will  all  start  with  dissimilarities  d,  defined  on  S*.  We  will 
denote  by  A*  the  n  x  n  inter  point  dissimilarity  matrix  on  the  H,  portion  of  Xn. 
An  obvious  approach  is  to  embed  each  sensor  into  a  separate  space,  and  then 
form  the  product:  essentially  forming  the  union  of  the  features  from  the  two 
sensors.  This  makes  no  use  of  any  redundancy  or  correlation  of  the  information 
in  the  two  sensors;  however,  when  the  sensors  are  sufficiently  different  then  an 
approach  like  this  might  well  be  what  is  required.  We  will  discuss  below  some 
ways  of  investigating  whether  this  might  be  the  case. 

2  A  Separate  Embedding  Approach 


idl  Po{ 

Rd  < . ^  Rd 


The  idea  of  this  is  to  define  the  embeddings  pi  independently,  then  define  a 
mapping  Q  that  maps  the  x(]  as  close  as  possible  to  the  matching  a:] .  This  is 
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usually  performed  by  Procrustes:1 

arg  min  ||11-X0Q||. 

QTQ=i 

We  define  the  fidelity  of  the  embedding  in  terms  of  raw  stress  (see  Borg  and 
Groenen  [1997]  for  discussion  of  this  and  other  criteria  that  might  be  used  in 
its  place).  The  fidelity  measures  how  well  the  embedded  points  match  their 
respective  dissimilarities: 

n— 1 

i= 1  j>i 

It  is  well  known  that  MDS  minimizes  the  fidelity  error.2  Thus,  the  separate 
embedding  approach  optimizes  the  fidelity  of  each  embedding  separately.This 
does  not  guarantee  that  the  resulting  two  point  sets  are  commensurate.  The 
Procrustes  embedding  is  designed  to  optimize  this  commensurability,  under  the 
rigid  motion  constraint,  which  ensures  that  the  fidelity  is  retained.  We  define 
the  commensurability  error  as: 

n 

i—1 

In  this  definition  we  abuse  notation  by  using  the  same  symbol  x ®  to  denote  the 
image  of  x®  under  the  Procrustes  transformation.  This  allows  us  to  refer  to  the 
commensurability  as  a  criterion  on  the  ultimate  embedded  points  regardless  of 
how  the  embedding  is  performed. 

Similarly,  a  canonical  correlation  approach  can  be  used  to  optimize  commen¬ 
surability  without  regard  to  fidelity.  For  the  details  see  Priebe  et  al.  [2010]. 

3  Joint  Embedding 

Define  AA  =  AA1  +  (1  —  A)A°.  The  joint  embedding  approach  we  consider 
utilizes  the  three  inter  point  dissimilarity  matrices,  A1,  A0  and  AA  for  a  fixed 
A  £  [0, 1]  (usually  A  =  |).  Form  the  2 n  x  2 n  omnibus  dissimilarity  matrix: 


Then  use  A  to  define  the  embeddings. 

1For  simplicity  we  are  assuming  that  the  dissimilarities  are  on  the  same  scale,  so  that  no 
scaling  is  required  of  the  Procrustes  transformation.  We  also  assume  that  they  are  centered, 
so  that  we  do  not  have  to  translate  (although  both  of  these  can  easily  be  incorporated  in  the 
Procrustes  methodology). 

2 Different  methods  of  MDS  have  been  developed  for  minimizing  various  criteria.  As  dis¬ 
cussed  above  we  focus  on  raw  stress  for  ease  of  exposition. 
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It  should  be  noted  that  this  approach  assumes  that  the  dissimilarity  matrices 
are  on  the  same  scale.  In  the  separate  embedding  approach  one  can  incorpo¬ 
rate  scale  into  the  Procrustes  transformation,  but  the  joint  embedding  method 
should  be  used  on  matrices  that  have  been  scaled  appropriately.  How  best  to 
do  this  is  a  topic  for  another  time. 

We  are  optimizing: 

£  =  -  A°,-)2  +  J2(d(Xl,Xj)  -  +Y/(d(Xf,Xj)  -  W)2. 

In  this,  W  is  taking  the  place  of  d{X°,X1)  which  is  unknown,  and  in  some 
applications,  may  not  even  be  meaningful.  Note  that  we  impute  the  diagonal 
of  W  to  be  0.  Given  our  notion  that  matched  documents  are  “the  same”,  it 
is  reasonable  that  their  distance  should  be  small,  and  this  choice  attempts  to 
force  this.  However,  we  know  experimentally  that  this  is  not  the  optimal  choice 
for  the  diagonal  of  W.  Clearly  the  choice  of  this  matrix  is  an  area  for  future 
research. 

The  diagonal  of  the  third  term  is  the  commensurability  error.  We  call  the 
off-diagonal  term  the  separability  error.  That  is,  we  also  want  to  ensure  that 
non-matched  pairs  are  appropriately  far  apart. 


4  How  Can  This  Fail? 

Jointly  optimizing  fidelity  and  commensurability  seems  to  be  an  excellent  strat¬ 
egy,  but  one  might  wonder  if  there  are  cases  in  which  it  cannot  be  effective.  In 
Priebe  et  al.  [2010]  is  a  brief  discussion  of  this  in  terms  of  Hausdorff  distance. 
Consider  the  case  where  the  spaces  are  Euclidean  and  the  embeddings  7 r,  are  lin¬ 
ear  projections  onto  subspaces.  The  Hausdorff  distance  between  two  subspaces 
is  2  8111(0/2),  where  9  is  the  canonical  angle  between  the  subspaces.  Essentially, 
this  says  that  if  the  subspaces  are  too  far  apart  -  orthogonal  -  the  points  will 
not  be  commensurate.  To  turn  this  around,  highly  incommensurate  points  in 
the  training  data  are  indicators  that  the  embedding  approach  proposed  may  be 
inappropriate. 


5  Experimental  Results 

We  apply  the  approach,  suitably  modified,  to  a  problem  in  modeling  random 
graphs.  The  model  we  investigate  is  the  random  dot  product  graph  (RDPG) 
model  (see  Marchette  and  Priebe  [2008]),  in  which  each  actor  in  a  social  network 
has  a  vector  of  attributes,  and  the  edge  probabilities  are  a  function  of  the  dot 
product  of  these  vectors.  In  this  problem  we  have  external  measurements  related 
to  these  vectors,  and  wish  to  use  this  extra  information  to  improve  the  model 
fit.  In  this  case  the  joint  embedding  approach  is  very  well  suited,  and  produces 
a  gratifyingly  large  improvement  in  fit. 

The  random  dot  product  graph  (RDPG)  model  is  a  simple  model  of  social 
networks  that  relates  attributes  of  the  actors  (vertices)  to  their  social  relation- 
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ships  (edges).  For  each  vertex  v  is  given  a  random  vector  Xv,  and  the  probability 
of  an  edge  between  u  and  v  is  given  by: 

P[u~v]=XfXi>. 

The  edges  are  conditionally  independent  given  the  X’s.  Given  a  graph  G  on 
n  vertices  (all  graphs  will  be  simple  (no  self-loops  or  multiple  edges)  and  undi¬ 
rected),  we  wish  to  fit  the  model:  find  X  that  “best”  fits  the  graph.  We  will 
use  the  convention  that  X  and  its  estimates  are  n  x  d  dimensional  matrices. 

We  define  best  in  terms  of  the  Frobenius  norm:  given  the  adjacency  matrix 
A  of  the  graph,  minimize 

||^-XXT||i 

Thus,  we  are  considering  squared  error  as  our  criterion.  Note  that  we  can  solve 
this  easily  via  spectral  methods.  The  optimal  solution  is  available  through  the 
eigenvalues  and  eigenvectors  of  A.  Note  further  however  that  this  is  not  quite 
what  we  want  since  the  diagonal  of  the  adjacency  matrix  should  be  ignored.  We 
thus  augment  the  diagonal  with  and  estimate  of  the  norm  of  the  Xv :  it  can  be 
shown  that  the  expected  degree  of  a  vertex  Edv  =  (n—l)E\\Xv\\  (this  is  exactly 
analogous  to  the  similar  formula  for  Erdos- Renyf  graphs).  Define 

A  =  A  +  diag(— ^— ), 
n  —  1 

and  now  minimize 

11^  —  XXT\\2- 

The  optimum  is  found  as  X  =  \Z~KU,  where  A  is  the  diagonal  matrix  containing 
the  d  largest  eigenvalues  of  A  and  U  is  the  n  x  d  matrix  formed  from  the 
corresponding  eigenvectors. 

Now,  suppose  we  observe  attributes  Y  for  the  vertices  that  are  correlated 
with  the  model  parameters  X.  Can  we  use  these  to  obtain  an  improved  fit  to 
the  graph?  The  answer  is  investigated  in  the  next  set  of  experiments.  In  all 
experiments,  n  =  100  (the  number  of  vertices  in  the  graph)  and  we  run  each 
experiment  N  =  100  times  to  obtain  the  box-plots  in  the  Figures  below. 

Given  a  d  +  1-dimensional  vector  X,  define  X~  as  the  d-dimensional  vec¬ 
tor  consisting  of  the  first  d  components  of  X.  In  our  experiments  we  will  take 
d  =  3.  Let  pi  =  (0.0,  0.4, 0.1,  0.5)  and  p2  <  —(0.6, 0.1, 0.1, 0.1),  then  in  our 
experiments  half  of  the  vertices  are  distributed  X  ~  Dirichlet(100pi  +  1)_,  and 
half  distributed  X  ~  Dirichlet(100p2  +  1)_.  These  result  in  small  clouds  of 
points  around  the  centering  points  p,;.  Let  G  ~  RDPG(X)  denote  the  RDPG 
model  defined  by  X.  We  will  consider  two  choices  for  Y  in  the  following  exper¬ 
iments:  1)  Y  ~  Diriclilet(rX  +  1)  (here  r  is  a  parameter  corresponding  roughly 
to  the  inverse  of  the  variance),  2)  Y  ~  Dirichlet(rX  +  1)_.  So  in  the  second, 
the  dimension  of  Y  is  d'  =  2.  In  all  experiments,  we  use  as  error  the  difference 
between  the  estimated  probability  and  the  true:  Error  =  ||XXT  —  ZZT |||  for 
estimate  Z. 


5 


1450 


-r 

7  T 

;  8 

8  i 

2  - 

i  a  a 

-DO- 

" 

S  : 

1- 

8  :  « 

1 

9 

.  a 

sS 

9 

A 

■ 

e  I?  -r  i 

*  « 

T  T  "  9  T  " 

a  S  m  b 

Error 

0.01  0.02 

1 

ii  5SB  ® 

1  * 

** 

100 

10 

5 

1  0.1 

0.01  0 

100  10  5 

1  0.1  0.01 

0 

Figure  1:  RDPG  experiments.  In  increasing  darkness  of  the  boxes,  these  cor¬ 
respond  to  the  original  estimate  X  from  the  graph  alone;  the  estimate  given 
by  Y  alone;  the  fusion  result;  the  estimate  given  by  the  average  of  X  and  QY, 
with  Q  the  Procrustes  transformation  to  map  these  together.  In  the  left  plot, 
d  =  d'  =  3  and  in  the  right  d!  =  2. 


Denote  by  Y  the  observed  value  of  Y .  Given  an  estimate  of  X  (say  using 
the  above  spectral  algorithm)  we  can  use  the  Procrustes  transform  to  define  Q 
to  map  X  and  Y  together,  giving  one  estimate  as  the  average:  (X  +  QY)/ 2.  Y 
gives  us  a  third  estimate  (ignoring  the  graph  altogether),  and  a  forth  is  through 
the  joint  embedding  as  follows.  Let  B  =  YYT 1  W  =  {check A  +  B) / 2,  and  form 


E  = 


A  W  \ 
W  B  )  ' 


Treat  this  as  in  the  spectral  algorithm  above,  obtaining  a  2 n  x  d  dimensional 
matrix,  and  return  the  average  of  the  top  n  vectors  with  the  bottom  n  vectors 
(pairing  the  ith  vector  with  the  (n  +  *)th).  For  case  1  above,  the  results  are 
shown  in  Figure  1  (left),  and  for  case  2,  in  Figure  1  (right).  As  r  increases, 
Yv  has  less  and  less  variance,  and  as  r  decreases  Yv  increases  variance,  until 
at  r  =  0  Y  is  uniform  in  the  simplex  independent  of  the  corresponding  X 
value.  Note  that  this  approach,  while  not  a  dissimilarity  approach,  is  actually 
quite  similar  to  the  joint  embedding  approach  we  discussed  above.  Classical 
multidimensional  scaling  essentially  performs  the  spectral  estimate  on  a  matrix 
related  to  E  above  -  a  centering  of  the  squared  adjacency  matrix  that  moves 
from  dissimilarity  space  into  dot  product  space,  if  you  will.  Thus,  we  view  this 
approach  to  RDPG  as  all  part  of  the  same  set  of  algorithms. 

Note  that  for  case  1,  a  simple  averaging  of  X  and  Y  is  possible  (assuming 
that  we  have  performed  a  Procrustes  to  ensure  that  these  are  commensurate) 
and  this  is  depicted  in  the  figure  as  the  darkest  box.  The  notches  on  the  boxes 
give  an  idea  of  significance:  if  the  notches  overlap,  the  two  approaches  cannot  be 
considered  significantly  different.  For  this  particular  simulation  (and  our  goal 
of  investigating  the  properties  of  our  method)  the  case  r  =  1  is  the  sweet  spot: 
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Figure  2:  Third  RDPG  experiment.  Comparison  of  the  estimate  of  X  using  the 
graph  only  (light  gray)  and  the  estimate  obtained  by  adding  noise  to  the  graph. 
See  the  text  for  details. 


the  joint  embedding  approach  is  the  best  by  far,  and  yet  the  noise  on  the  Y  is 
so  extreme  that  it  is  itself  a  terrible  estimate  of  the  model’s  vectors. 

In  the  second  experiment,  we  do  not  have  the  option  of  averaging  (since  we 
assume  we  do  not  know  what  combination  of  the  three  coordinates  of  X  we 
are  measuring  with  Y)  and  so  we  only  plot  the  results  for  the  two  individual 
approaches  and  the  joint  embedding.  The  results  are  given  in  Figure  1  (right). 
Clearly  the  combination  of  the  information  is  superior  to  the  separate  ones. 

Note  that  the  joint  embedding  approach  is  relatively  insensitive  to  noise  - 
in  fact  it  seems  to  ignore  the  attributes  for  r  =  0.  Adding  noise  can  (somewhat 
counter-intuitively)  sometimes  improve  estimates,  and  to  some  degree  that  is 
what  is  happening  in  the  r  =  1  case. 

To  better  understand  this  phenomenon,  consider  Figure  2.  We  perform  the 
following  experiment:  Let  both  X  and  Y  be  drawn  uniformly  in  the  simplex, 
independently.  Thus  there  is  no  “signal”  in  Y .  Form  the  RDPG  graph  G  from 
X  and  let  A  be  its  adjacency  matrix  as  above.  For  A  £  [0,1],  form  Bx  = 
A A  +  (1  —  A )YYT .  Then  use  the  spectral  approach  on  B  to  obtain  the  estimate 
Xx.  We  plot  the  errors  of  this  approach  compared  to  the  estimate  which  uses 
A  only  (as  in  the  previous  plots)  in  Figure  2.  Note  first  that,  as  expected,  when 
A  =  0  the  estimate  is  bad  (there  is  no  information  about  the  graph  in  Bx  and 
when  A  =  1  the  estimates  agree  perfectly.  The  interesting  part  occurs  in  the 
range  A  £  [0.3,0.95].  As  long  as  there  is  some  information  from  the  graph, 
averaging  in  this  noise  improves  the  estimate.  Whether  this  is  because  the 
binary  representation  of  the  probabilities  from  the  adjacency  matrix  does  not 
give  the  algorithm  sufficient  flexibility  to  fit  the  model  (all  those  0  probability 
estimates  that  should  not  be  0)  or  for  some  other  reason  is  an  area  for  future 
research. 
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6  Discussion 

We  have  described  a  method  for  combining  information  from  two  sensors,  and 
the  method  easily  extends  to  multiple  sensors.  We  discussed  the  situations  in 
which  the  method  does  not  work,  at  least  not  without  some  modifications  or 
significant  work.  This  discussion  links  a  measure  of  comparability  of  the  sensors 
-  the  connnensurability  of  the  embedded  points  -  to  the  ultimate  performance 
of  the  inference,  and  thus  provides  a  useful  method  for  diagnosing  when  the 
algorithm  is  likely  to  be  applicable  and  when  not. 

The  joint  embedding  method  is  clearly  worth  considering  when  the  data  have 
a  natural  dissimilarity  function  available,  or  when  the  data  come  in  the  form 
of  a  dissimilarity  matrix  (which  is  often  the  case  in  Psychological  experiments 
and  in  some  Brain  mapping  experiments).  When  the  data  are  presented  as 
features,  other  methods  of  fusion  immediately  suggest  themselves  and  should 
not  be  ignored.  Simply  forming  the  product  (appending  all  the  features  together 
into  one  long  vector)  and  then  performing  feature  selection  and  dimensionality 
reduction  is  a  well-used  and  time-honored  approach,  and  should  always  be  a 
part  of  our  toolkit. 

The  random  graph  experiment  showed  some  surprising  results.  Adding  noise 
to  the  graph  can  improve  estimation  using  the  spectral  approach,  and  the  joint 
embedding  provides  a  natural  way  to  take  advantage  of  this.  Further  research 
is  clearly  suggested,  and  this  will  be  one  of  the  major  areas  in  which  we  will  be 
involved  in  the  future. 
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